Don't Use a Lot When Little Will Do: Genre Identification Using URLs
نویسندگان
چکیده
The ever increasing data on world wide web calls for the use of vertical search engines. Sandhan is one such search engine which offers search in tourism and health genres in more than 10 different Indian languages. In this work we build a URL based genre identification module for Sandhan. A direct impact of this work is on building focused crawlers to gather Indian language content. We conduct experiments on tourism and health web pages in Hindi language. We experiment with three approaches list based, naive Bayes and incremental naive Bayes. We evaluate our approaches against another web page classification algorithm built on the parsed text of manually labeled web pages. We find that incremental naive Bayes approach outperforms the other two. While doing our experiments we work with different features like words, n-grams and all grams. Using n-gram features we achieve classification accuracies of 0.858 and 0.873 for tourism and health genres respectively.
منابع مشابه
How do Writers Present Their Work in Introduction Sections? A Genre-based Investigation into Qualitative and Quantitative Research Articles
Research articles have received a wide interest in discourse studies particularly in genre analysis over the last few decades. A vast number of studies have been centered on identifying the organizational patterns of research articles in various fields. While Introduction section has enjoyed a lot of attention, very few studies have focused on rhetorical structure of qualitative and quantitativ...
متن کاملWhat’s in a URL? Genre Classification from URLs
The importance of URLs in the representation of a document cannot be overstated. Shorthand mnemonics such as “wiki” or “blog” are often embedded in a URL to convey its functional purpose or genre. Other mnemonics have evolved from use (e.g., a Wordpress particle is strongly suggestive of blogs). Can we leverage from this predictive power to induce the genre of a document from the representation...
متن کامل-
The development and evolution of any system–person, organization–nation depends on how the system succeeds to bridge the gap between what the system knows and what the system does (with the knowledge). We call this the gap between knowing and doing or the knowing-doing gap. If the system does not do what it knows, it will lose out in competition with other systems, its relative performance in...
متن کاملFeature-based Malicious URL and Attack Type Detection Using Multi-class Classification
Nowadays, malicious URLs are the common threat to the businesses, social networks, net-banking etc. Existing approaches have focused on binary detection i.e. either the URL is malicious or benign. Very few literature is found which focused on the detection of malicious URLs and their attack types. Hence, it becomes necessary to know the attack type and adopt an effective countermeasure. This pa...
متن کاملChemical and ecological control methods for Epitrix spp.
Very little information exists in regards to the control options available for potato flea beetles, Epitrix spp. This short review covers both chemical and ecological options currently available for control of Epitrix spp. Synthetic pyrethroids are the weapon of choice for the beetles. However, the impetus in integrated pest management is to do timely (early-season) applicatio...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- Research in Computing Science
دوره 70 شماره
صفحات -
تاریخ انتشار 2013